In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

1. Read the dataset.

In [2]:
data = pd.read_csv('pubg_dataset.csv')
data.head()
Out[2]:
Id groupId matchId assists boosts damageDealt DBNOs headshotKills heals killPlace ... revives rideDistance roadKills swimDistance teamKills vehicleDestroys walkDistance weaponsAcquired winPoints winPlacePerc
0 2f262dd9795e60 78437bcd91d40e d5db3a49eb2955 0 0 0.0 0 0 0 92 ... 0 0.0 0 0.0 0 0 0.0 0 1470 0.0000
1 a32847cf5bf34b 85b7ce5a12e10b 65223f05c7fdb4 0 0 163.2 1 1 0 42 ... 0 0.0 0 0.0 0 0 132.7 2 1531 0.2222
2 1b1900a9990396 edf80d6523380a 1cadec4534f30a 0 3 278.7 2 1 8 16 ... 3 0.0 0 0.0 0 0 3591.0 10 0 0.8571
3 f589dd03b60bf2 804ab5e5585558 c4a5676dc91604 0 0 191.9 1 0 0 31 ... 0 0.0 0 0.0 0 0 332.7 3 0 0.3462
4 c23c4cc5b78b35 b3e2cd169ed920 cd595700a01bfa 0 0 100.0 1 0 0 87 ... 0 0.0 0 0.0 0 0 252.7 3 1557 0.0690

5 rows × 29 columns

2. Check the datatype of all the columns.

In [3]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 29 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               10000 non-null  object 
 1   groupId          10000 non-null  object 
 2   matchId          10000 non-null  object 
 3   assists          10000 non-null  int64  
 4   boosts           10000 non-null  int64  
 5   damageDealt      10000 non-null  float64
 6   DBNOs            10000 non-null  int64  
 7   headshotKills    10000 non-null  int64  
 8   heals            10000 non-null  int64  
 9   killPlace        10000 non-null  int64  
 10  killPoints       10000 non-null  int64  
 11  kills            10000 non-null  int64  
 12  killStreaks      10000 non-null  int64  
 13  longestKill      10000 non-null  float64
 14  matchDuration    10000 non-null  int64  
 15  matchType        10000 non-null  object 
 16  maxPlace         10000 non-null  int64  
 17  numGroups        10000 non-null  int64  
 18  rankPoints       10000 non-null  int64  
 19  revives          10000 non-null  int64  
 20  rideDistance     10000 non-null  float64
 21  roadKills        10000 non-null  int64  
 22  swimDistance     10000 non-null  float64
 23  teamKills        10000 non-null  int64  
 24  vehicleDestroys  10000 non-null  int64  
 25  walkDistance     10000 non-null  float64
 26  weaponsAcquired  10000 non-null  int64  
 27  winPoints        10000 non-null  int64  
 28  winPlacePerc     10000 non-null  float64
dtypes: float64(6), int64(19), object(4)
memory usage: 2.2+ MB

3. Find the summary of all the numerical columns and write your findings about it.

In [4]:
data.describe()
Out[4]:
assists boosts damageDealt DBNOs headshotKills heals killPlace killPoints kills killStreaks ... revives rideDistance roadKills swimDistance teamKills vehicleDestroys walkDistance weaponsAcquired winPoints winPlacePerc
count 10000.000000 10000.000000 10000.000000 10000.00000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 ... 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.00000 10000.0000 10000.000000
mean 0.234600 1.088500 129.211264 0.64400 0.221700 1.354000 47.663100 506.970200 0.913400 0.543800 ... 0.160200 600.693584 0.004200 4.385917 0.024400 0.007700 1130.008410 3.63590 609.3440 0.469926
std 0.575149 1.703279 167.193945 1.09562 0.577046 2.629102 27.424146 627.297959 1.524117 0.701948 ... 0.454045 1524.915601 0.074719 30.889620 0.171486 0.089674 1168.597983 2.42209 739.7924 0.304508
min 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.0000 0.000000
25% 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 24.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 151.575000 2.00000 0.0000 0.200000
50% 0.000000 0.000000 83.805000 0.00000 0.000000 0.000000 48.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 650.350000 3.00000 0.0000 0.458300
75% 0.000000 2.000000 185.325000 1.00000 0.000000 2.000000 71.000000 1169.000000 1.000000 1.000000 ... 0.000000 0.000575 0.000000 0.000000 0.000000 0.000000 1923.250000 5.00000 1495.0000 0.735100
max 7.000000 18.000000 3469.000000 11.00000 14.000000 31.000000 100.000000 1926.000000 35.000000 4.000000 ... 5.000000 28780.000000 3.000000 971.200000 3.000000 2.000000 10490.000000 41.00000 1863.0000 1.000000

8 rows × 25 columns

4. The average person kills how many players?

In [5]:
print("The average person kills",data['kills'].mean(),"players.")
The average person kills 0.9134 players.

5. 99% of people have how many kills?

In [6]:
print("99% of people have",np.percentile(data['kills'],99),"kills.")
99% of people have 7.0 kills.

6. The most kills ever recorded are how much?

In [7]:
print("The most kills ever recorded:",data['kills'].max())
The most kills ever recorded: 35

7. Print all the columns of the dataframe.

In [8]:
data.columns
Out[8]:
Index(['Id', 'groupId', 'matchId', 'assists', 'boosts', 'damageDealt', 'DBNOs',
       'headshotKills', 'heals', 'killPlace', 'killPoints', 'kills',
       'killStreaks', 'longestKill', 'matchDuration', 'matchType', 'maxPlace',
       'numGroups', 'rankPoints', 'revives', 'rideDistance', 'roadKills',
       'swimDistance', 'teamKills', 'vehicleDestroys', 'walkDistance',
       'weaponsAcquired', 'winPoints', 'winPlacePerc'],
      dtype='object')

8. Comment on distribution of the match's duration. Use seaborn.

In [9]:
sns.distplot(data['matchDuration']);

print("Within 1250 to 1500, match duration is high.")
Within 1250 to 1500, match duration is high.

9. Comment on distribution of the walk distance. Use seaborn.

In [10]:
sns.distplot(data['walkDistance']);

10. Plot distribution of the match's duration vs walk distance one below the other.

In [11]:
plt.style.use('seaborn')
plt.figure()
plt.subplot(2,1,1)
plt.plot(data['matchDuration'],'-')
plt.subplot(2,1,2)
plt.plot(data['walkDistance'],'-');

11. Plot distribution of the match's duration vs walk distance side by side.

In [12]:
plt.figure(figsize=(10,3))
plt.subplot(1,2,1)
plt.plot(data['matchDuration'],'-')
plt.subplot(1,2,2)
plt.plot(data['walkDistance'],'-');

12. Pairplot the dataframe. Comment on kills vs damage dealt, Comment on maxPlace vs numGroups.

In [13]:
sns.pairplot(data);

13. How many unique values are there in 'matchType' and what are their counts?

In [14]:
data['matchType'].value_counts()
Out[14]:
squad-fpp           3969
duo-fpp             2282
squad               1359
solo-fpp            1234
duo                  702
solo                 386
normal-squad-fpp      24
normal-duo-fpp        13
crashfpp              13
normal-solo-fpp        8
normal-squad           4
flaretpp               3
crashtpp               2
flarefpp               1
Name: matchType, dtype: int64

14. Plot a barplot of ‘matchType’ vs 'killPoints'. Write your inferences.

In [15]:
sns.barplot(x='matchType',y='killPoints',data=data);
plt.xticks(rotation=70);

15. Plot a barplot of ‘matchType’ vs ‘weaponsAcquired’. Write your inferences.

In [16]:
sns.barplot(x='matchType',y='weaponsAcquired',data=data);
plt.xticks(rotation=70);

16. Find the Categorical columns.

In [17]:
data.select_dtypes(['category']).columns
Out[17]:
Index([], dtype='object')

17. Plot a boxplot of ‘matchType’ vs ‘winPlacePerc’. Write your inferences.

In [18]:
sns.boxplot(x='matchType',y='winPlacePerc',data=data);
plt.xticks(rotation=70);

18. Plot a boxplot of ‘matchType’ vs ‘matchDuration’. Write your inferences.

In [19]:
sns.boxplot(x='matchType',y='matchDuration',data=data);
plt.xticks(rotation=70);

19. Change the orientation of the above plot to horizontal.

In [20]:
sns.boxplot(x='matchDuration',y='matchType',data=data);
plt.xticks(rotation=70);

20. Add a new column called ‘KILL’ which contains the sum of following columns viz. headshotKills,teamKills, roadKills.

In [21]:
data['KILL'] = data['headshotKills'] + data['teamKills'] + data['roadKills']
data['KILL']
Out[21]:
0       0
1       1
2       1
3       0
4       0
       ..
9995    0
9996    0
9997    0
9998    0
9999    0
Name: KILL, Length: 10000, dtype: int64

21. Round off column ‘winPlacePerc’ to 2 decimals.

In [22]:
data['winPlacePerc'] = round(data['winPlacePerc'], 2)
data['winPlacePerc']
Out[22]:
0       0.00
1       0.22
2       0.86
3       0.35
4       0.07
        ... 
9995    0.83
9996    0.72
9997    0.21
9998    0.24
9999    0.19
Name: winPlacePerc, Length: 10000, dtype: float64

22. Take a sample of size 50 from the column damageDealt for 100 times and calculate its mean. Plot it on a histogram and comment on its distribution.

In [35]:
data_arr = []
for i in range(100):
    data_arr.append(data['damageDealt'].sample(50).mean())

sns.distplot(data_arr);
plt.xlabel('damageDealt');